32 research outputs found
Towards Bidirectional Hierarchical Representations for Attention-Based Neural Machine Translation
This paper proposes a hierarchical attentional neural translation model which
focuses on enhancing source-side hierarchical representations by covering both
local and global semantic information using a bidirectional tree-based encoder.
To maximize the predictive likelihood of target words, a weighted variant of an
attention mechanism is used to balance the attentive information between
lexical and phrase vectors. Using a tree-based rare word encoding, the proposed
model is extended to sub-word level to alleviate the out-of-vocabulary (OOV)
problem. Empirical results reveal that the proposed model significantly
outperforms sequence-to-sequence attention-based and tree-based neural
translation models in English-Chinese translation tasks.Comment: Accepted for publication at EMNLP 201
Assessing the Ability of Self-Attention Networks to Learn Word Order
Self-attention networks (SAN) have attracted a lot of interests due to their
high parallelization and strong performance on a variety of NLP tasks, e.g.
machine translation. Due to the lack of recurrence structure such as recurrent
neural networks (RNN), SAN is ascribed to be weak at learning positional
information of words for sequence modeling. However, neither this speculation
has been empirically confirmed, nor explanations for their strong performances
on machine translation tasks when "lacking positional information" have been
explored. To this end, we propose a novel word reordering detection task to
quantify how well the word order information learned by SAN and RNN.
Specifically, we randomly move one word to another position, and examine
whether a trained model can detect both the original and inserted positions.
Experimental results reveal that: 1) SAN trained on word reordering detection
indeed has difficulty learning the positional information even with the
position embedding; and 2) SAN trained on machine translation learns better
positional information than its RNN counterpart, in which position embedding
plays a critical role. Although recurrence structure make the model more
universally-effective on learning word order, learning objectives matter more
in the downstream tasks such as machine translation.Comment: ACL 201
Context-Aware Self-Attention Networks
Self-attention model have shown its flexibility in parallel computation and
the effectiveness on modeling both long- and short-term dependencies. However,
it calculates the dependencies between representations without considering the
contextual information, which have proven useful for modeling dependencies
among neural representations in various natural language tasks. In this work,
we focus on improving self-attention networks through capturing the richness of
context. To maintain the simplicity and flexibility of the self-attention
networks, we propose to contextualize the transformations of the query and key
layers, which are used to calculates the relevance between elements.
Specifically, we leverage the internal representations that embed both global
and deep contexts, thus avoid relying on external resources. Experimental
results on WMT14 English-German and WMT17 Chinese-English translation tasks
demonstrate the effectiveness and universality of the proposed methods.
Furthermore, we conducted extensive analyses to quantity how the context
vectors participate in the self-attention model.Comment: AAAI 201
EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning
Expressing universal semantics common to all languages is helpful in
understanding the meanings of complex and culture-specific sentences. The
research theme underlying this scenario focuses on learning universal
representations across languages with the usage of massive parallel corpora.
However, due to the sparsity and scarcity of parallel data, there is still a
big challenge in learning authentic ``universals'' for any two languages. In
this paper, we propose EMMA-X: an EM-like Multilingual pre-training Algorithm,
to learn (X)Cross-lingual universals with the aid of excessive multilingual
non-parallel data. EMMA-X unifies the cross-lingual representation learning
task and an extra semantic relation prediction task within an EM framework.
Both the extra semantic classifier and the cross-lingual sentence encoder
approximate the semantic relation of two sentences, and supervise each other
until convergence. To evaluate EMMA-X, we conduct experiments on XRETE, a newly
introduced benchmark containing 12 widely studied cross-lingual tasks that
fully depend on sentence-level representations. Results reveal that EMMA-X
achieves state-of-the-art performance. Further geometric analysis of the built
representation space with three requirements demonstrates the superiority of
EMMA-X over advanced models.Comment: Accepted by NeurIPS 202
Competency-Aware Neural Machine Translation: Can Machine Translation Know its Own Translation Quality?
Neural machine translation (NMT) is often criticized for failures that happen
without awareness. The lack of competency awareness makes NMT untrustworthy.
This is in sharp contrast to human translators who give feedback or conduct
further investigations whenever they are in doubt about predictions. To fill
this gap, we propose a novel competency-aware NMT by extending conventional NMT
with a self-estimator, offering abilities to translate a source sentence and
estimate its competency. The self-estimator encodes the information of the
decoding procedure and then examines whether it can reconstruct the original
semantics of the source sentence. Experimental results on four translation
tasks demonstrate that the proposed method not only carries out translation
tasks intact but also delivers outstanding performance on quality estimation.
Without depending on any reference or annotated data typically required by
state-of-the-art metric and quality estimation methods, our model yields an
even higher correlation with human quality judgments than a variety of
aforementioned methods, such as BLEURT, COMET, and BERTScore. Quantitative and
qualitative analyses show better robustness of competency awareness in our
model.Comment: accepted to EMNLP 202
WR-ONE2SET: Towards Well-Calibrated Keyphrase Generation
Keyphrase generation aims to automatically generate short phrases summarizing
an input document. The recently emerged ONE2SET paradigm (Ye et al., 2021)
generates keyphrases as a set and has achieved competitive performance.
Nevertheless, we observe serious calibration errors outputted by ONE2SET,
especially in the over-estimation of token (means "no
corresponding keyphrase"). In this paper, we deeply analyze this limitation and
identify two main reasons behind: 1) the parallel generation has to introduce
excessive as padding tokens into training instances; and 2) the
training mechanism assigning target to each slot is unstable and further
aggravates the token over-estimation. To make the model
well-calibrated, we propose WR-ONE2SET which extends ONE2SET with an adaptive
instance-level cost Weighting strategy and a target Re-assignment mechanism.
The former dynamically penalizes the over-estimated slots for different
instances thus smoothing the uneven training distribution. The latter refines
the original inappropriate assignment and reduces the supervisory signals of
over-estimated slots. Experimental results on commonly-used datasets
demonstrate the effectiveness and generality of our proposed paradigm.Comment: EMNLP202
Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling
As a special machine translation task, dialect translation has two main
characteristics: 1) lack of parallel training corpus; and 2) possessing similar
grammar between two sides of the translation. In this paper, we investigate how
to exploit the commonality and diversity between dialects thus to build
unsupervised translation models merely accessing to monolingual data.
Specifically, we leverage pivot-private embedding, layer coordination, as well
as parameter sharing to sufficiently model commonality and diversity among
source and target, ranging from lexical, through syntactic, to semantic levels.
In order to examine the effectiveness of the proposed models, we collect 20
million monolingual corpus for each of Mandarin and Cantonese, which are
official language and the most widely used dialect in China. Experimental
results reveal that our methods outperform rule-based simplified and
traditional Chinese conversion and conventional unsupervised translation models
over 12 BLEU scores.Comment: AAAI 202